class: center, middle, inverse, title-slide .title[ # Introduction to R for Data Analysis ] .subtitle[ ## Data Types, Import & Export ] .author[ ### Johannes Breuer, Stefan Jünger & Veronika Batzdorfer ] .date[ ### 2021-08-02 ] --- layout: true --- ## Getting data into `R` Thus far, we've already learned what `R` and `RStudio` are. This course is about starting to use `R` and feeling prepared to use it for statistical analyses. There's one essential prerequisite: .center[**We need data!**] <img src="data:image/png;base64,#../img/import_data.png" width="50%" style="display: block; margin: auto;" /> --- ## Content of this session - What are `R`'s internal data types? - How to work with different data types? - How to import data in different formats? - How to export data in different formats --- ## Data we use in this course During the course, we use several different data sets. Especially in this session, where we apply different importing functions, we quite a few data sets, from data about the Titanic to data about unicorns. However, we will also use data that are more interesting for social and behavioral scientists. --- ## It boils all down to... .pull-left[ **How your data are stored (data types)** - 'Numbers' (Integers & Doubles) - Character Strings - Logical - Factors - ... - There's more, e.g., expressions, but let's leave it at that ] .pull-right[ **Where your data are stored (data formats)** - Vectors - Matrices - Arrays - Data frames / Tibbles - Lists ] .footnote[https://www.stat.berkeley.edu/~nolan/stat133/Fall05/lectures/DataTypes4.pdf] --- ## Numeric data .small[ *Integers* are values without a decimal value. To be explicit in `R` in using them, you have to place an `L` behind the actual value. ```r 1L ``` ``` ## [1] 1 ``` By contrast, *doubles* are values with a decimal value. ```r 1.1 ``` ``` ## [1] 1.1 ``` We can check data types by using the `typeof()` function. ```r typeof(1L) ``` ``` ## [1] "integer" ``` ```r typeof(1.1) ``` ``` ## [1] "double" ``` ] --- ## Character strings At first glance, a *character* is a letter somewhere between a-z. *String* in this context might mean that we have a series of characters. However, numbers and other symbols can be part of a *character string*, which can then be, e.g., part of a text. In `R`, character strings are wrapped in quotation marks. ```r "Hi. I am a character string, the 1st of its kind!" ``` ``` ## [1] "Hi. I am a character string, the 1st of its kind!" ``` *Note*: There are no values associated with the content of character strings unless we change that, e.g., with factors. --- ## Factors If you're a *Stata* (or *SPSS*) user, you may already be familiar with factors. Factors are data types that assume that their values are not continuous, e.g., as in [ordinal](https://en.wikipedia.org/wiki/Level_of_measurement#Ordinal_scale) or [nominal](https://en.wikipedia.org/wiki/Level_of_measurement#Nominal_level) data. ```r factor(1.1) ``` ``` ## [1] 1.1 ## Levels: 1.1 ``` ```r factor("Hi. I am a character string, the 1st of its kind!") ``` ``` ## [1] Hi. I am a character string, the 1st of its kind! ## Levels: Hi. I am a character string, the 1st of its kind! ``` Factors take numeric data or character strings as input as they simply convert them into so-called levels. This concept may be a little bit abstract for the time being. It's just essential to have heard about them before you learn more about them. --- ## Logical values Logical values are basically either `TRUE` or `FALSE` values. These values are produced by making logical requests on your data. ```r 2 > 1 ``` ``` ## [1] TRUE ``` ```r 2 < 1 ``` ``` ## [1] FALSE ``` Logical values are at the heart of creating loops. For this purpose, however, we need more logical operators to request `TRUE` or `FALSE` values. --- ## Logical operators There are quite a few logical operators in `R`: .pull-left[ - `<` less than - `<=` less than or equal to - `>` greater than - `>=` greater than or equal to - `== ` exactly equal to - `!=` not equal to ] .pull-right[ - `!x` Not x - `x | y` x OR y - `x & y ` x AND y - `isTRUE(x)` test if X is TRUE - `isFALSE(x)` test if X is FALSE ] .footnote[https://www.statmethods.net/management/operators.html] Moreover, there are some more `is.PROPERTY_ASKED_FOR()` functions, such as `is.numeric()`, which also return `TRUE` or `FALSE` values. --- ## `R`'s data formats `R`'s different data types can be put into 'containers'. <img src="data:image/png;base64,#../img/9213.1526125966.png" width="75%" style="display: block; margin: auto;" /> .footnote[https://devopedia.org/r-data-structures] --- ## Vectors Vectors are built by enclosing your content with `c()` ("c" for "concatenate") ```r numeric_vector <- c(1, 2, 3, 4) character_vector <- c("a", "b", "c", "d") numeric_vector ``` ``` ## [1] 1 2 3 4 ``` ```r character_vector ``` ``` ## [1] "a" "b" "c" "d" ``` Vectors are really like vectors in mathematics. Initially, it doesn't matter if you look at them as column or row vectors. --- ## ...but it matters when you combine vectors Using the function `cbind()` or `rbind()` you can either combine vectors column-wise or row-wise. Thus, they become matrices. ```r cbind(numeric_vector, character_vector) ``` ``` ## numeric_vector character_vector ## [1,] "1" "a" ## [2,] "2" "b" ## [3,] "3" "c" ## [4,] "4" "d" ``` ```r rbind(numeric_vector, character_vector) ``` ``` ## [,1] [,2] [,3] [,4] ## numeric_vector "1" "2" "3" "4" ## character_vector "a" "b" "c" "d" ``` .small[ *Note*: The numeric values are [coerced](https://www.oreilly.com/library/view/r-in-a/9781449358204/ch05s08.html) into strings here. ] --- ## Matrices Matrices are the basic rectangular data format in R. ```r fancy_matrix <- matrix(1:16, nrow = 4) fancy_matrix ``` ``` ## [,1] [,2] [,3] [,4] ## [1,] 1 5 9 13 ## [2,] 2 6 10 14 ## [3,] 3 7 11 15 ## [4,] 4 8 12 16 ``` You cannot store multiple data types, such as strings and numeric values in the same matrix. Otherwise, your data will get coerced to a common type, as seen in the previous slide. This is something that happens already within vectors: ```r c(1, 2, "evil string") ``` ``` ## [1] "1" "2" "evil string" ``` --- ## Data frames While matrices are used, e.g.,--\*drumroll\*-- for matrix operations, data frames resemble more the data formats most of you are probably already familiar with. We can build data frames by hand as here: .tinyish[ ```r library(randomNames) # a name generator package fancy_data <- data.frame( who = randomNames(n = 10, which.names = "first"), age = sample(14:49, 10, replace = TRUE), # you see what we are doing here? salary_2018 = sample(15:100, 10, replace = TRUE), salary_2019 = sample(15:100, 10, replace = TRUE) ) fancy_data ``` ] .right[↪️] --- class: middle ``` ## who age salary_2018 salary_2019 ## 1 Nikolai 44 64 43 ## 2 Abdur Rasheed 17 91 70 ## 3 Reema 38 85 83 ## 4 Claudia 19 72 46 ## 5 Carl 38 100 93 ## 6 Meghan 22 66 38 ## 7 Fidda 49 83 71 ## 8 Brandy 39 46 68 ## 9 Shafee'a 48 30 60 ## 10 Jeffrey 45 15 26 ``` --- ## Tibbles .pull-left[ Tibbles are basically just `R data.frames` but nicer. - only the first ten observations are printed - the output is tidier! - you get some additional metadata about rows and columns that you would normally only get when using `dim()` and other functions You can check the [tibble vignette](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html) for technical details. ] .pull-right[ <img src="data:image/png;base64,#../img/tibble.png" width="60%" style="display: block; margin: auto;" /> ] --- ## Tibble conversion ```r library(tibble) as_tibble(fancy_data) ``` ``` ## # A tibble: 10 × 4 ## who age salary_2018 salary_2019 ## <chr> <int> <int> <int> ## 1 Nikolai 44 64 43 ## 2 Abdur Rasheed 17 91 70 ## 3 Reema 38 85 83 ## 4 Claudia 19 72 46 ## 5 Carl 38 100 93 ## 6 Meghan 22 66 38 ## 7 Fidda 49 83 71 ## 8 Brandy 39 46 68 ## 9 Shafee'a 48 30 60 ## 10 Jeffrey 45 15 26 ``` --- ## One last type you should know: lists Lists are perfect for storing numerous and potentially diverse pieces of information in one place. ```r fancy_list <- list( numeric_vector, character_vector, fancy_matrix, fancy_data ) fancy_list ``` .right[↪️] --- class: middle .tinyish[ ``` ## [[1]] ## [1] 1 2 3 4 ## ## [[2]] ## [1] "a" "b" "c" "d" ## ## [[3]] ## [,1] [,2] [,3] [,4] ## [1,] 1 5 9 13 ## [2,] 2 6 10 14 ## [3,] 3 7 11 15 ## [4,] 4 8 12 16 ## ## [[4]] ## who age salary_2018 salary_2019 ## 1 Nikolai 44 64 43 ## 2 Abdur Rasheed 17 91 70 ## 3 Reema 38 85 83 ## 4 Claudia 19 72 46 ## 5 Carl 38 100 93 ## 6 Meghan 22 66 38 ## 7 Fidda 49 83 71 ## 8 Brandy 39 46 68 ## 9 Shafee'a 48 30 60 ## 10 Jeffrey 45 15 26 ``` ] --- ## Nested lists ```r fancy_nested_list <- list( fancy_vectors = list(numeric_vector, character_vector), data_stuff = list(fancy_matrix, fancy_data) ) fancy_nested_list ``` .right[↪️] --- class: middle .tinyish[ ``` ## $fancy_vectors ## $fancy_vectors[[1]] ## [1] 1 2 3 4 ## ## $fancy_vectors[[2]] ## [1] "a" "b" "c" "d" ## ## ## $data_stuff ## $data_stuff[[1]] ## [,1] [,2] [,3] [,4] ## [1,] 1 5 9 13 ## [2,] 2 6 10 14 ## [3,] 3 7 11 15 ## [4,] 4 8 12 16 ## ## $data_stuff[[2]] ## who age salary_2018 salary_2019 ## 1 Nikolai 44 64 43 ## 2 Abdur Rasheed 17 91 70 ## 3 Reema 38 85 83 ## 4 Claudia 19 72 46 ## 5 Carl 38 100 93 ## 6 Meghan 22 66 38 ## 7 Fidda 49 83 71 ## 8 Brandy 39 46 68 ## 9 Shafee'a 48 30 60 ## 10 Jeffrey 45 15 26 ``` ] --- ## Accessing elements by index Generally, the logic of `[index_number]` is used in `R` to access only a subset of information in an object, no matter if we have vectors or data frames. Say, we want to extract the second element of our `character_vector` object, we could do that like this: ```r character_vector[2] ``` ``` ## [1] "b" ``` --- ## More complicated cases: matrices Matrices can have more dimensions, often you want information from a specific row and column. ```r a_wonderful_matrix[number_of_row, number_of_column] ``` *Note*: You can do the same indexing with `data.frame`s. We will talk more about this in the session on *Data Wrangling Basics*. --- ## Matrices and subscripts (as in mathematical notation) Identifying rows, columns, or elements using subscripts is similar to matrix notation: ```r fancy_matrix[, 4] # 4th column of matrix fancy_matrix[3,] # 3rd row of matrix fancy_matrix[2:4, 1:3] # rows 2,3,4 of columns 1,2,3 ``` It's really like in math, and you can perform standard mathematical operations, such as matrix multiplications. ```r fancy_matrix[2:4, 1:3] %*% fancy_matrix[1:3, 2:4] ``` ``` ## [,1] [,2] [,3] ## [1,] 116 188 260 ## [2,] 134 218 302 ## [3,] 152 248 344 ``` --- ## The case of data frames A nice feature of `data.frames` or `tibbles` is that their columns are names, just as variable names in ordinary data. It would be cumbersome to use index numbers to extract a specific column/variable, right? Do not fear: ```r fancy_data$who ``` ``` ## [1] "Nikolai" "Abdur Rasheed" "Reema" "Claudia" "Carl" "Meghan" "Fidda" "Brandy" "Shafee'a" ## [10] "Jeffrey" ``` Just place a `$`-sign between the data object and the variable name. --- ## `[]` in data frames Sometimes we also have to rely on character strings as input information, e.g., for iterating over data. We can also use `[]` to access variables by name. .pull-left[ Not only this way: ```r fancy_data[1] ``` ``` ## who ## 1 Nikolai ## 2 Abdur Rasheed ## 3 Reema ## 4 Claudia ## 5 Carl ## 6 Meghan ## 7 Fidda ## 8 Brandy ## 9 Shafee'a ## 10 Jeffrey ``` ] .pull-right[ But also this way: ```r fancy_data["who"] ``` ``` ## who ## 1 Nikolai ## 2 Abdur Rasheed ## 3 Reema ## 4 Claudia ## 5 Carl ## 6 Meghan ## 7 Fidda ## 8 Brandy ## 9 Shafee'a ## 10 Jeffrey ``` ] --- ## Difference between `[]` and `[[]]` https://twitter.com/hadleywickham/status/643381054758363136 --- ## Data frame check 1, 2, 1, 2! Once you start working with data in `R` a good first thing to do is to have a quick look at them. The most high-level information you can get is about the object type and its dimensions. .small[ ```r # object type class(fancy_data) ``` ``` ## [1] "data.frame" ``` ```r # number of rows and columns dim(fancy_data) ``` ``` ## [1] 10 4 ``` ```r # number of rows nrow(fancy_data) ``` ``` ## [1] 10 ``` ```r # number of columns ncol(fancy_data) ``` ``` ## [1] 4 ``` ] --- ## Data frame check 1, 2, 1, 2! You can also print the first 6 lines of the data frame with `head()`. You can easily change the number of lines by providing the number as the second argument to the `head()` function. ```r head(fancy_data, 3) ``` ``` ## who age salary_2018 salary_2019 ## 1 Nikolai 44 64 43 ## 2 Abdur Rasheed 17 91 70 ## 3 Reema 38 85 83 ``` --- ## Data frame check 1, 2, 1, 2! If we want some more (detailed) information about the data set or object, we can use the `base R` function `str()`. ```r str(fancy_data) ``` ``` ## 'data.frame': 10 obs. of 4 variables: ## $ who : chr "Nikolai" "Abdur Rasheed" "Reema" "Claudia" ... ## $ age : int 44 17 38 19 38 22 49 39 48 45 ## $ salary_2018: int 64 91 85 72 100 66 83 46 30 15 ## $ salary_2019: int 43 70 83 46 93 38 71 68 60 26 ``` --- ## Data frame check 1, 2, 1, 2! If you want to have a look at your full data set, you can use the `View()` function. In *RStudio*, this will open a new tab in the source pane through which you can explore the data set (including a search function). You can also click on the small spreadsheet symbol on the right side of the object in the environment tab to open this view. ```r View(fancy_data) ``` <img src="data:image/png;base64,#../img/rstudio_view.png" width="65%" style="display: block; margin: auto;" /> --- ## Viewing and changing names We can print all names of an object using the `names()` function... ```r names(fancy_data) ``` ``` ## [1] "who" "age" "salary_2018" "salary_2019" ``` ...and we can also change names with it. ```r names(fancy_data) <- c("name", "age", "salary_2018", "salary_2019") names(fancy_data) ``` ``` ## [1] "name" "age" "salary_2018" "salary_2019" ``` However, there are more flexible ways of doing this as we will see in the session on *Data Wrangling Basics* tomorrow. --- class: center, middle # [Exercise](https://stefanjuenger.github.io/r-intro-gesis-2022/exercises/Exercise_1_2_1_Data_Types.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://stefanjuenger.github.io/r-intro-gesis-2022/solutions/Exercise_1_2_1_Data_Types.html) --- ## German General Social Survey 2021 (GGSS/ALLBUS) .left-column[ <img src="data:image/png;base64,#../img/allbus.png" width="400" style="display: block; margin: auto;" /> ] .right-column[ For most of the examples and exercises in this course we will use data from the [German General Social Survey 2021 (GGSS/ALLBUS)](https://www.gesis.org/en/institute/research-data-centers/rdc-allbus). You can [download the data set in different formats as well as the codebook and the questionnaire (in German) from the *GESIS* website](https://search.gesis.org/research_data/ZA5280) (note: you need to have/create a user account). Theresa also prepared a subset documentation of variables that might be interesting for you in the `./data` folder. The *GGSS* website provides [detailed documentation](https://www.gesis.org/en/allbus/contents-search). ] --- ## Gapminder Data .left-column[ <img src="data:image/png;base64,#../img/gapminder_logo.png" width="1200" style="display: block; margin: auto;" /> ] .right-column[ We will also use [data from *Gapminder*](https://www.gapminder.org/data/). During the course and the exercises, we work with data we have downloaded from their website. There also is an `R` package that bundles some of the *Gapminder* data: `install.packages("gapminder")`. This `R` package provides ["[a]n excerpt of the data available at Gapminder.org. For each of 142 countries, the package provides values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007."](https://cran.r-project.org/web/packages/gapminder/index.html) ] --- ## How to use the data in general To code along and be able to do the exercises, you should store the data files for the *GGSS 2021* in a folder called `./data` in the same folder as the other materials for this course. --- ## `R` is data-agnostic <img src="data:image/png;base64,#../img/Datenimport.PNG" width="65%" style="display: block; margin: auto;" /> --- ## Data formats & packages .pull-left[ **What you will learn** - Getting the most common data formats into `R` - e.g., CSV, *Stata*, *SPSS*, or *Excel* spreadsheets - Using the different methods of doing that - We will rely a lot on packages and functions from the `tidyverse` instead of using `base R` ] .pull-right[ **What you won't learn** - Getting old & obscure binary data formats into `R` - ... although [that is possible](https://cran.r-project.org/doc/manuals/r-release/R-data.html) ] --- ## Before writing any code: *RStudio* functionality for importing data You can use the *RStudio* GUI for importing data via `Environment - Import data set - Choose file type`. <img src="data:image/png;base64,#../img/rstudio_import.PNG" width="716" style="display: block; margin: auto;" /> --- ## Where to find data **Browse Button in `RStudio`** <img src="data:image/png;base64,#../img/importBrowse.PNG" width="75%" style="display: block; margin: auto;" /> **Code preview in `Rstudio`** <img src="data:image/png;base64,#../img/codepreview.PNG" width="75%" style="display: block; margin: auto;" /> --- ## Honestly, after some time you will write the code directly .center[ <img src="data:image/png;base64,#../img/coding_cat.gif" style="display: block; margin: auto;" /> .footnote[[Source](https://media.giphy.com/media/LmNwrBhejkK9EFP504/source.gif)] ] --- ## Honestly, after some time you will write the code directly .center[ <img src="data:image/png;base64,#../img/hadley-typing.gif" style="display: block; margin: auto;" /> [Source](https://tenor.com/view/hadley-wickham-rstats-typing-rcode-gif-11365139) ] --- ## Simple vs. not so simple file formats Basic file formats, such as CSV (comma-separated value file), can directly be imported into `R` - they are 'flat' - few metadata - basically text files Other file formats, particularly the proprietary ones, require the use of additional packages - they are complex - a lot of metadata (think of all the labels in an *SPSS* file) - they are binary (1110101) --- ## File formats wars <img src="data:image/png;base64,#../img/norm_normal_file_format.png" width="30%" style="display: block; margin: auto;" /> https://xkcd.com/2116/ --- ## Disclaimer **In the following slides, we'll jump right into importing data. We use a lot of different packages for this purpose, and you don't have to remember everything. It's just for making a point of how agnostic `R` actually is regarding the file type. Later on, we will dive more into the specifics of importing.** --- ## Importing a CSV file using `base R` ```r titanic <- read.csv("./data/titanic.csv") titanic ``` .tinyish[ ``` ## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare ## 1 1 0 3 Braund, Mr. Owen Harris male 22.00 1 0 A/5 21171 7.2500 ## 2 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.00 1 0 PC 17599 71.2833 ## 3 3 1 3 Heikkinen, Miss. Laina female 26.00 0 0 STON/O2. 3101282 7.9250 ## 4 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.00 1 0 113803 53.1000 ## 5 5 0 3 Allen, Mr. William Henry male 35.00 0 0 373450 8.0500 ## 6 6 0 3 Moran, Mr. James male NA 0 0 330877 8.4583 ## 7 7 0 1 McCarthy, Mr. Timothy J male 54.00 0 0 17463 51.8625 ## 8 8 0 3 Palsson, Master. Gosta Leonard male 2.00 3 1 349909 21.0750 ## 9 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.00 0 2 347742 11.1333 ## 10 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.00 1 0 237736 30.0708 ## 11 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.00 1 1 PP 9549 16.7000 ## 12 12 1 1 Bonnell, Miss. Elizabeth female 58.00 0 0 113783 26.5500 ## 13 13 0 3 Saundercock, Mr. William Henry male 20.00 0 0 A/5. 2151 8.0500 ## 14 14 0 3 Andersson, Mr. Anders Johan male 39.00 1 5 347082 31.2750 ## 15 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.00 0 0 350406 7.8542 ## 16 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.00 0 0 248706 16.0000 ## 17 17 0 3 Rice, Master. Eugene male 2.00 4 1 382652 29.1250 ## 18 18 1 2 Williams, Mr. Charles Eugene male NA 0 0 244373 13.0000 ## 19 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele) female 31.00 1 0 345763 18.0000 ## 20 20 1 3 Masselmani, Mrs. Fatima female NA 0 0 2649 7.2250 ## 21 21 0 2 Fynney, Mr. Joseph J male 35.00 0 0 239865 26.0000 ## 22 22 1 2 Beesley, Mr. Lawrence male 34.00 0 0 248698 13.0000 ## 23 23 1 3 McGowan, Miss. Anna "Annie" female 15.00 0 0 330923 8.0292 ## 24 24 1 1 Sloper, Mr. William Thompson male 28.00 0 0 113788 35.5000 ## 25 25 0 3 Palsson, Miss. Torborg Danira female 8.00 3 1 349909 21.0750 ## 26 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson) female 38.00 1 5 347077 31.3875 ## 27 27 0 3 Emir, Mr. Farred Chehab male NA 0 0 2631 7.2250 ## 28 28 0 1 Fortune, Mr. Charles Alexander male 19.00 3 2 19950 263.0000 ## 29 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female NA 0 0 330959 7.8792 ## 30 30 0 3 Todoroff, Mr. Lalio male NA 0 0 349216 7.8958 ## 31 31 0 1 Uruchurtu, Don. Manuel E male 40.00 0 0 PC 17601 27.7208 ## 32 32 1 1 Spencer, Mrs. William Augustus (Marie Eugenie) female NA 1 0 PC 17569 146.5208 ## 33 33 1 3 Glynn, Miss. Mary Agatha female NA 0 0 335677 7.7500 ## 34 34 0 2 Wheadon, Mr. Edward H male 66.00 0 0 C.A. 24579 10.5000 ## 35 35 0 1 Meyer, Mr. Edgar Joseph male 28.00 1 0 PC 17604 82.1708 ## 36 36 0 1 Holverson, Mr. Alexander Oskar male 42.00 1 0 113789 52.0000 ## 37 37 1 3 Mamee, Mr. Hanna male NA 0 0 2677 7.2292 ## 38 38 0 3 Cann, Mr. Ernest Charles male 21.00 0 0 A./5. 2152 8.0500 ## 39 39 0 3 Vander Planke, Miss. Augusta Maria female 18.00 2 0 345764 18.0000 ## 40 40 1 3 Nicola-Yarred, Miss. Jamila female 14.00 1 0 2651 11.2417 ## 41 41 0 3 Ahlin, Mrs. Johan (Johanna Persdotter Larsson) female 40.00 1 0 7546 9.4750 ## 42 42 0 2 Turpin, Mrs. William John Robert (Dorothy Ann Wonnacott) female 27.00 1 0 11668 21.0000 ## 43 43 0 3 Kraeff, Mr. Theodor male NA 0 0 349253 7.8958 ## 44 44 1 2 Laroche, Miss. Simonne Marie Anne Andree female 3.00 1 2 SC/Paris 2123 41.5792 ## 45 45 1 3 Devaney, Miss. Margaret Delia female 19.00 0 0 330958 7.8792 ## 46 46 0 3 Rogers, Mr. William John male NA 0 0 S.C./A.4. 23567 8.0500 ## 47 47 0 3 Lennon, Mr. Denis male NA 1 0 370371 15.5000 ## 48 48 1 3 O'Driscoll, Miss. Bridget female NA 0 0 14311 7.7500 ## 49 49 0 3 Samaan, Mr. Youssef male NA 2 0 2662 21.6792 ## 50 50 0 3 Arnold-Franchi, Mrs. Josef (Josefine Franchi) female 18.00 1 0 349237 17.8000 ## 51 51 0 3 Panula, Master. Juha Niilo male 7.00 4 1 3101295 39.6875 ## 52 52 0 3 Nosworthy, Mr. Richard Cater male 21.00 0 0 A/4. 39886 7.8000 ## 53 53 1 1 Harper, Mrs. Henry Sleeper (Myna Haxtun) female 49.00 1 0 PC 17572 76.7292 ## 54 54 1 2 Faunthorpe, Mrs. Lizzie (Elizabeth Anne Wilkinson) female 29.00 1 0 2926 26.0000 ## 55 55 0 1 Ostby, Mr. Engelhart Cornelius male 65.00 0 1 113509 61.9792 ## 56 56 1 1 Woolner, Mr. Hugh male NA 0 0 19947 35.5000 ## 57 57 1 2 Rugg, Miss. Emily female 21.00 0 0 C.A. 31026 10.5000 ## 58 58 0 3 Novel, Mr. Mansouer male 28.50 0 0 2697 7.2292 ## 59 59 1 2 West, Miss. Constance Mirium female 5.00 1 2 C.A. 34651 27.7500 ## 60 60 0 3 Goodwin, Master. William Frederick male 11.00 5 2 CA 2144 46.9000 ## 61 61 0 3 Sirayanian, Mr. Orsen male 22.00 0 0 2669 7.2292 ## 62 62 1 1 Icard, Miss. Amelie female 38.00 0 0 113572 80.0000 ## 63 63 0 1 Harris, Mr. Henry Birkhardt male 45.00 1 0 36973 83.4750 ## 64 64 0 3 Skoog, Master. Harald male 4.00 3 2 347088 27.9000 ## 65 65 0 1 Stewart, Mr. Albert A male NA 0 0 PC 17605 27.7208 ## 66 66 1 3 Moubarek, Master. Gerios male NA 1 1 2661 15.2458 ## 67 67 1 2 Nye, Mrs. (Elizabeth Ramell) female 29.00 0 0 C.A. 29395 10.5000 ## 68 68 0 3 Crease, Mr. Ernest James male 19.00 0 0 S.P. 3464 8.1583 ## 69 69 1 3 Andersson, Miss. Erna Alexandra female 17.00 4 2 3101281 7.9250 ## 70 70 0 3 Kink, Mr. Vincenz male 26.00 2 0 315151 8.6625 ## 71 71 0 2 Jenkin, Mr. Stephen Curnow male 32.00 0 0 C.A. 33111 10.5000 ## 72 72 0 3 Goodwin, Miss. Lillian Amy female 16.00 5 2 CA 2144 46.9000 ## 73 73 0 2 Hood, Mr. Ambrose Jr male 21.00 0 0 S.O.C. 14879 73.5000 ## 74 74 0 3 Chronopoulos, Mr. Apostolos male 26.00 1 0 2680 14.4542 ## 75 75 1 3 Bing, Mr. Lee male 32.00 0 0 1601 56.4958 ## 76 76 0 3 Moen, Mr. Sigurd Hansen male 25.00 0 0 348123 7.6500 ## 77 77 0 3 Staneff, Mr. Ivan male NA 0 0 349208 7.8958 ## 78 78 0 3 Moutal, Mr. Rahamin Haim male NA 0 0 374746 8.0500 ## 79 79 1 2 Caldwell, Master. Alden Gates male 0.83 0 2 248738 29.0000 ## 80 80 1 3 Dowdell, Miss. Elizabeth female 30.00 0 0 364516 12.4750 ## 81 81 0 3 Waelens, Mr. Achille male 22.00 0 0 345767 9.0000 ## 82 82 1 3 Sheerlinck, Mr. Jan Baptist male 29.00 0 0 345779 9.5000 ## 83 83 1 3 McDermott, Miss. Brigdet Delia female NA 0 0 330932 7.7875 ## Cabin Embarked ## 1 S ## 2 C85 C ## 3 S ## 4 C123 S ## 5 S ## 6 Q ## 7 E46 S ## 8 S ## 9 S ## 10 C ## 11 G6 S ## 12 C103 S ## 13 S ## 14 S ## 15 S ## 16 S ## 17 Q ## 18 S ## 19 S ## 20 C ## 21 S ## 22 D56 S ## 23 Q ## 24 A6 S ## 25 S ## 26 S ## 27 C ## 28 C23 C25 C27 S ## 29 Q ## 30 S ## 31 C ## 32 B78 C ## 33 Q ## 34 S ## 35 C ## 36 S ## 37 C ## 38 S ## 39 S ## 40 C ## 41 S ## 42 S ## 43 C ## 44 C ## 45 Q ## 46 S ## 47 Q ## 48 Q ## 49 C ## 50 S ## 51 S ## 52 S ## 53 D33 C ## 54 S ## 55 B30 C ## 56 C52 S ## 57 S ## 58 C ## 59 S ## 60 S ## 61 C ## 62 B28 ## 63 C83 S ## 64 S ## 65 C ## 66 C ## 67 F33 S ## 68 S ## 69 S ## 70 S ## 71 S ## 72 S ## 73 S ## 74 C ## 75 S ## 76 F G73 S ## 77 S ## 78 S ## 79 S ## 80 S ## 81 S ## 82 S ## 83 Q ## [ reached 'max' / getOption("max.print") -- omitted 808 rows ] ``` ] --- ## A `readr` example: `CSV` files ```r library(readr) titanic <- read_csv("./data/titanic.csv") ``` --- class: middle .tinyish[ ```r titanic ``` ``` ## # A tibble: 891 × 12 ## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked ## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <chr> ## 1 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 <NA> S ## 2 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.3 C85 C ## 3 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.92 <NA> S ## 4 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1 C123 S ## 5 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.05 <NA> S ## 6 6 0 3 Moran, Mr. James male NA 0 0 330877 8.46 <NA> Q ## 7 7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.9 E46 S ## 8 8 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.1 <NA> S ## 9 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2 347742 11.1 <NA> S ## 10 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0 237736 30.1 <NA> C ## # … with 881 more rows ## # ℹ Use `print(n = ...)` to see more rows ``` ] Note the column specifications: `readr` 'guesses' them based on the first 1000 observations (we will come back to this later). --- ## Importing *Excel* files with `readxl` ```r library(readxl) unicorns <- read_xlsx("./data/observations.xlsx") ``` No output ☹️ --- class: middle ```r unicorns ``` ``` ## # A tibble: 42 × 3 ## countryname year pop ## <chr> <dbl> <dbl> ## 1 Austria 1670 85 ## 2 Austria 1671 83 ## 3 Austria 1674 75 ## 4 Austria 1675 82 ## 5 Austria 1676 79 ## 6 Austria 1677 70 ## 7 Austria 1678 81 ## 8 Austria 1680 80 ## 9 France 1673 70 ## 10 France 1674 79 ## # … with 32 more rows ## # ℹ Use `print(n = ...)` to see more rows ``` --- ## *Stata* files with `haven` ```r library(haven) allbus_2021_stata <- read_stata("./data/allbus_2021/ZA5280_v1-0-0.dta") allbus_2021_stata ``` .right[↪️] --- class: middle ``` ## za_nr doi version respid substudy mode splt21 eastwest german ep01 ep03 ep04 ep06 lm01 lm02 lm19 lm20 lm21 lm22 lm14 ## 1 5280 doi:10.4232/1.13954 1.0.0 (2022-07-13) 1 1 4 2 1 1 3 3 3 4 2 210 1 2 1 0.5 1 ## xr19 xr20 lm27 lm28 lm29 lm30 lm31 lm32 lm33 lm34 lm35 lm36 lm37 lm38 lm39 la01 id02 id01 mi05 mi06 mi07 mi08 mi09 mi10 mi11 sex mborn yborn age ## 1 1 1 1 0 0 1 0 0 0 0 2 1 2 2 3 3 2 3 1 1 2 1 1 1 1 2 10 1966 54 ## agec dn07 dm02 dm02c dm03 dg10 dg03 dm06 dn01 dn02 dn04 dn05 ma01b ma02 ma03 ma04 mc01 mc02 mc03 mc04 pn11 fr07 fr08 fr03b fr04b fr05b fr09 fr10 ## 1 3 1 -10 -10 -10 1 4 -10 0 -10 1 1 -11 -11 -11 -11 -11 -11 -11 -11 -11 2 3 2 3 2 3 4 ## fr11 fr12 fe13 fe14 fe15 fe16 fe17 ja01 ja02 ja03 ja04 ja05 ja06 ja07 ja08 ja09 ja10 ja11 lp03 lp04 lp05 lp06 vm08 vm09 vm10 vm11 vm12 vm13 vm14 ## 1 2 1 5 3 1 4 2 7 5 3 2 6 6 5 2 5 5 5 1 2 1 1 2 3 2 3 3 2 3 ## vm15 st01 pt01 pt02 pt03 pt04 pt06 pt07 pt08 pt09 pt10 pt11 pt12 pt14 pt15 pt19 pt20 ca24 cf03 im01 im17 im18 im19 im20 im21 iw04 pd11 pi07 pi01 ## 1 2 3 7 7 4 2 3 7 4 2 4 7 6 3 3 5 5 3 3 2 2 1 4 3 4 1 2 2 1 ## pi02 pc01 pc02 pc03 pc04 pc05 pc06 pc07 pc08 pc09 pc10 pc11 pc17 pc19 pc20 pa02a va01 va02 va03 va04 ingle pa01 ps03 ca01 ca02 ca03 ca04 ca05 ca06 ## 1 3 3 3 4 2 2 4 2 4 4 2 3 2 4 2 3 4 1 3 2 1 4 2 1 1 2 2 2 3 ## ca07 ca08 ca09 ca10 ca11 ca25 ca26 ca27 ca28 ca29 ca30 ca15 ca16 ca17 ca18 ca34 ca31 ca35 ca36 cs01 cs02 cs03 cs04 cs05 cs06 cs08 cs09 cp01 cp02 ## 1 2 1 1 1 1 1 1 3 2 4 3 1 2 1 1 1 1 2 1 1 1 1 1 1 2 2 2 4 2 ## cp03 cp04 ce01 ce02 ca22 ca23 ca32 ca33 educ de06 de07 de08 de09 de10 de12 de11 de13 de14 de15 de16 de05 de18 de17 isced97 iscd11 work dw01 dw02 ## 1 4 3 -11 1 2 1 1 4 5 0 0 1 1 0 0 0 1 0 0 0 0 -10 -10 5 5 1 5 51 ## isco88 siops88 isei88 isco08 siops08 isei08 eseg dw07 dw15 dw10 dw16 dw17 dw18 dw19 dw19c dw37 dw03 dw12 dw12a dw12b dw01a dw02a isco88a siops88a ## 1 4223 38 52 5244 26 38.88 71 2 35 2 1 -10 2 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 ## isei88a isco08a siops08a isei08a dw20 dw22 dw23 dw23c hs01 hs04 hs05 hs06 hs07 hs08 hs09 lp09 lp10 lp11 lp12 rb07 rd01 rd02 rd03 rp01 rp02 mj01 ## 1 -10 -10 -10 -10 -10 -10 -10 -10 4 3 2 3 3 2 5 3 4 4 3 10 1 -10 -10 4 -10 -11 ## mj02 mj03 mj04 mj05 mj06 mm01 mm02 mm03 mm04 mm05 mm06 mstat scmborn scyborn scage scagec sceduc scde06 scde07 scde08 scde09 scde10 scde12 scde11 ## 1 -11 -11 -11 -11 -11 -11 -11 -11 -11 -11 -11 5 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 ## scde13 scde14 scde15 scde16 scde05 scde17 scde18 sciscd97 sciscd11 scwork scdw01 scdw02 scisco88 scsiop88 scisei88 scisco08 scsiop08 scisei08 ## 1 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 ## sceseg scdw07 scdw03 dp01 dp03 pmborn pyborn page pagec peduc pde06 pde07 pde08 pde09 pde10 pde12 pde11 pde13 pde14 pde15 pde16 pde05 pde17 pde18 ## 1 -10 -10 -10 1 2 12 1965 55 4 5 0 0 1 1 0 0 0 0 1 0 0 0 3 -10 ## pisced97 piscd11 pwork pdw01 pdw02 pisco88 psiops88 pisei88 pisco08 psiops08 pisei08 peseg pdw07 pdw03 fdm01 mdm01 df44 fdw01 fdw02 fisco88 ## 1 5 7 1 5 52 3416 49 50 3323 48.86 56.35 33 2 -10 996 0 1 3 22 7412 ## fsiops88 fisei88 fisco08 fsiops08 fisei08 feseg mdw01 mdw02 misco88 msiops88 misei88 misco08 msiops08 misei08 meseg feduc meduc fde01 mde01 ## 1 33 31 3122 45.94 40.54 42 3 22 3419 46 55 5223 42.78 28.48 42 2 2 9 6 ## fiscd975 miscd975 di01a di02a incc dh01 dh11 dh04 dh09 hh2kin hh2sex hh2mborn hh2yborn hh2age hh2mstat hh3kin hh3sex hh3mborn hh3yborn hh3age ## 1 5 3 -15 10 10 2 -10 1 1 -10 -10 -15 -10 -10 -10 -10 -10 -15 -10 -10 ## hh3mstat hh4kin hh4sex hh4mborn hh4yborn hh4age hh4mstat hh5kin hh5sex hh5mborn hh5yborn hh5age hh5mstat hh6kin hh6sex hh6mborn hh6yborn hh6age ## 1 -10 -10 -10 -15 -10 -10 -10 -10 -10 -15 -10 -10 -10 -10 -10 -15 -10 -10 ## hh6mstat hh7kin hh7sex hh7mborn hh7yborn hh7age hh7mstat hh8kin hh8sex hh8mborn hh8yborn hh8age hh8mstat dh12 dh13 dh14 dh15 dh16 dh17 fh01 fh02 ## 1 -10 -10 -10 -15 -10 -10 -10 -10 -10 -15 -10 -10 -10 41 141 3 10 0 54 -10 -10 ## fh03 fh04 fh05 fh06 fh07 fh08 fh09 fh10 fh11 di01b di02b di05 di06 hhincc dk05 dk06 kh1sex kh1yborn kh1age kh2sex kh2yborn kh2age kh3sex kh3yborn ## 1 -10 -10 -10 -10 -10 -10 -10 -10 -10 -15 -10 -15 10 10 2 -10 -10 -10 -10 -10 -10 -10 -10 -10 ## kh3age kh4sex kh4yborn kh4age kh5sex kh5yborn kh5age kh6sex kh6yborn kh6age kh7sex kh7yborn kh7age kh8sex kh8yborn kh8age aq01 xh03 gs01 gd01 gd02 ## 1 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 4 2 4 -11 -32 ## dg13 dg08 dg09 dg11 cf01 cf04 cf05 cf06 cf07 cf08 cf09 cf10 cf11 pn12 pn16 pn17 mp16 mp17 mp18 mp19 hp01 hp02 hp03 hp04 hp05 hp06 hp07 hp08 sm01 ## 1 -11 -11 -11 -11 -11 3 4 3 4 4 4 4 4 3 2 2 3 3 4 4 -11 -11 -11 -11 -11 -11 -11 -11 2 ## sm02 sm03 pv01 ls01 xs14 xs01 xs02 xs03 xs04 xs05 xs06 xs11 xt01 xt02 xt03 xt04 xt05 xt06 xt12 xt13 xt14 xt07 xt08 xt09 xt10 xt10c land ## 1 2 2 3 8 1 1 0 0 0 0 -10 22 18 6 20210618 14 25 14.25 18 6 20210618 15 0 15 35 1 80 ## bik gkpol wghtpew wghtht wghthew wghthtew ## 1 3 3 1.247175 1.685751 1.235898 2.083416 ## [ reached 'max' / getOption("max.print") -- omitted 5341 rows ] ``` --- ## *SPSS* files with `haven` The `haven` package also offers the function `read_spss()` for importing *SPSS* files. The package also offers capabilities for handling *SPSS*-defined missing values by setting the option `user_na = TRUE` (default is `FALSE`). *Note*: The [`sjlabelled` package](https://cran.r-project.org/web/packages/sjlabelled/index.html) can also be used for [working with user-defined missings from *SPSS* files](https://cran.r-project.org/web/packages/sjlabelled/vignettes/intro_sjlabelled.html). **We will come back to *Stata* and *SPSS* files in a bit as they represent a specific file format in `R`: labelled data.** --- ## Other data import options These were just some very first examples of applying functions for data import from the different packages. There are many more... .pull-left[ `readr` - `read_csv()` - `read_tsv()` - `read_delim()` - `read_fwf()` - `read_table()` - `read_log()` ] .pull-right[ `haven` - `read_sas()` - `read_spss()` - `read_stata()` ] Not to mention all the helper functions and options. For example, we can define the cells to read from an *Excel* file by specifying the option `range = "C1:E4"` in `read_excel()` --- ## Data type specifications for `tibbles` - characters - indicated by `<chr>` - specified by `col_character()` - integers - indicated by `<int>` - specified by `col_integer()` - doubles - indicated by `<dbl>` - specified by `col_double()` - factors - indicated by `<fct>` - specified by `col_factor()` - logical - indicated by `<lgl>` - specified by `col_logical()` --- ## Changing variable types As mentioned before, `read_csv` 'guesses' the variable types by scanning the first 1000 observations. **NB**: This can go wrong! Luckily, we can change the variable type... - before/while loading the data - and after loading the data --- ## While loading the data in `read_csv` ```r titanic <- read_csv( "./data/titanic.csv", col_types = cols( PassengerId = col_double(), Survived = col_double(), Pclass = col_double(), Name = col_character(), Sex = col_character(), Age = col_double(), SibSp = col_double(), Parch = col_double(), Ticket = col_character(), Fare = col_double(), Cabin = col_character(), Embarked = col_character() ) ) titanic ``` .right[↪️] --- class: middle ``` ## # A tibble: 891 × 12 ## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked ## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <chr> ## 1 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 <NA> S ## 2 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.3 C85 C ## 3 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.92 <NA> S ## 4 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1 C123 S ## 5 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.05 <NA> S ## 6 6 0 3 Moran, Mr. James male NA 0 0 330877 8.46 <NA> Q ## 7 7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.9 E46 S ## 8 8 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.1 <NA> S ## 9 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2 347742 11.1 <NA> S ## 10 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0 237736 30.1 <NA> C ## # … with 881 more rows ## # ℹ Use `print(n = ...)` to see more rows ``` --- ## While loading the data in `read_csv` ```r titanic <- read_csv( "./data/titanic.csv", col_types = cols( PassengerId = col_double(), Survived = col_double(), Pclass = col_double(), Name = col_character(), Sex = col_factor(), # This one changed! Age = col_double(), SibSp = col_double(), Parch = col_double(), Ticket = col_character(), Fare = col_double(), Cabin = col_character(), Embarked = col_character() ) ) titanic ``` .right[↪️] --- class: middle ``` ## # A tibble: 891 × 12 ## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked ## <dbl> <dbl> <dbl> <chr> <fct> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <chr> ## 1 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 <NA> S ## 2 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.3 C85 C ## 3 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.92 <NA> S ## 4 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1 C123 S ## 5 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.05 <NA> S ## 6 6 0 3 Moran, Mr. James male NA 0 0 330877 8.46 <NA> Q ## 7 7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.9 E46 S ## 8 8 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.1 <NA> S ## 9 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2 347742 11.1 <NA> S ## 10 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0 237736 30.1 <NA> C ## # … with 881 more rows ## # ℹ Use `print(n = ...)` to see more rows ``` --- ## After loading the data ```r titanic <- type_convert( titanic, col_types = cols( PassengerId = col_double(), Survived = col_double(), Pclass = col_double(), Name = col_character(), Sex = col_factor(), Age = col_double(), SibSp = col_double(), Parch = col_double(), Ticket = col_character(), Fare = col_double(), Cabin = col_character(), Embarked = col_character() ) ) ``` --- ## Beyond flat files: labelled data A lot of data comes in some sort of flat file format, such as `CSV`. In the social sciences, however, we often deal with proprietary file formats, such as *SPSS*'s `.sav` or *Stata*'s `.dta` files. What these data typically include are labels. These labels are used to describe variables or variable values. They comprise some specific metadata inherent in these proprietary file formats. *If you were able to travel back ten years in time and ask an `R` geek, she'd say that you cannot use labels in R. You'd either have to import, e.g., value labels as character strings or use their codes as factors. However, these days...* --- ## Not being able to use labelled data is a thing of the past Nowadays, if you use the `haven` package, labels are built-in for the corresponding file types. For example: ```r allbus_2021 <- haven::read_sav("./data/allbus_2021/ZA5280_v1-0-0.sav") allbus_2021["agec"] ``` ``` ## # A tibble: 5,342 × 1 ## agec ## <dbl+lbl> ## 1 3 [45-59 JAHRE] ## 2 3 [45-59 JAHRE] ## 3 5 [75-89 JAHRE] ## 4 5 [75-89 JAHRE] ## 5 4 [60-74 JAHRE] ## 6 1 [18-29 JAHRE] ## 7 2 [30-44 JAHRE] ## 8 3 [45-59 JAHRE] ## 9 4 [60-74 JAHRE] ## 10 3 [45-59 JAHRE] ## # … with 5,332 more rows ## # ℹ Use `print(n = ...)` to see more rows ``` --- ## Advantages of using labelled data One could rejoice in not having to use a codebook anymore, just like in *SPSS* (although just looking at code output for glimpsing feels much more... data-geeky). An advantage is definitely that you can potentially re-use the labels in figures and plots, and some `R` packages do that automatically, such as the [`sjPlot`](https://strengejacke.github.io/sjPlot/) package. In addition, when you exchange your data with colleagues who do not use `R` or when you plan to publish your data (which you always should if that is possible), being able to export data you have manipulated in `R` in different formats is great. **However, be aware of the missing values hell that you may enter due to different missing value definitions in *Stata* and *SPSS*.** --- ## Getting labels For variables: ```r sjlabelled::get_label(allbus_2021$agec) ``` ``` ## [1] "ALTER: BEFRAGTE(R), KATEGORISIERT" ``` For values: .tinyish[ ```r sjlabelled::get_labels(allbus_2021$agec) ``` ``` ## [1] "NICHT GENERIERBAR" "18-29 JAHRE" "30-44 JAHRE" "45-59 JAHRE" "60-74 JAHRE" "75-89 JAHRE" "UEBER 89 JAHRE" ``` ] --- ## Setting labels: Variables ```r allbus_2021$agec <- sjlabelled::set_label(allbus_2021$agec, label = "Age, categorized") sjlabelled::get_label(allbus_2021$agec) ``` ``` ## [1] "Age, categorized" ``` --- ## Setting labels: Values .tinyish[ ```r allbus_2021$agec <- sjlabelled::set_labels( allbus_2021$agec, labels = c( "18-29 years", "30-44 years", "45-59 years", "60-74 years", "75-89 years", "Over 89 years" ) ) sjlabelled::get_labels(allbus_2021$agec) ``` ``` ## [1] "18-29 years" "30-44 years" "45-59 years" "60-74 years" "75-89 years" "Over 89 years" ``` ] --- class: center, middle # [Exercise](https://stefanjuenger.github.io/r-intro-gesis-2022/exercises/Exercise_1_2_2_Flat_Files.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://stefanjuenger.github.io/r-intro-gesis-2022/solutions/Exercise_1_2_2_Flat_Files.html) --- ## Exporting data Sometimes our data have to leave `R`, for example, if we.... - share data with colleagues who do not use `R` - want to continue where we left off - particularly if data wrangling took a long time For such purposes, we also need a way to export our data. All of the packages we have discussed in this session also have designated functions for that. <img src="data:image/png;base64,#../img/export_data.png" width="50%" style="display: block; margin: auto;" /> --- ## Examples: CSV and Stata files ```r write_csv(titanic, "titanic_own.csv") ``` ```r write_dta(titanic, "titanic_own.dta") ``` --- ## `R`'s native file formats If you plan to continue to work with `R` (something we would always recommend 😜), there are at least two native 'file formats' to choose from. The advantage of using them is that they are compressed files, so that they don't occupy unnecessarily large disk space. These two formats are `.Rdata`/`.rda` and `.rds`. The key difference between them is that `.rds` can only hold one object, whereas `.Rdata`/`.rda` can also be used for storing several objects in one file. --- ## `.Rdata`/`.rda` Saving ```r save(mydata, file = "mydata.RData") ``` Loading ```r load("mydata.RData") ``` --- ## `.rds` Saving ```r saveRDS(mydata, "mydata.rds") ``` Loading ```r mydata <- readRDS("mydata.rds") ``` *Note*: A nice property of `saveRDS()` is that just saves a representation of the object, which means you can name it whatever you want when loading. --- ## Saving just everything If you have not changed the General Global Options in *RStudio* as suggested in the *Getting Started* session, you may have noticed that, when closing *Rstudio*, by default, the programs asks you whether you want to save the workspace image. <img src="data:image/png;base64,#../img/save_image.png" width="50%" style="display: block; margin: auto;" /> You can also do that whenever you want using the `save.image()` function: ```r save.image(file = "my_fancy_workspace.RData") ``` .small[ *Note*: As we've said before, though, this is not something we'd recommend as a worfklow. Instead, you should (explicitly and separately) save your `R` scripts and data sets (in appropriate formats). ] --- ## Additional packages Besides `readr`, `haven` and `readxl`, there also are some other packages that facilitate importing specific data types as tibbles: - [`sjlabelled`](https://cran.r-project.org/web/packages/sjlabelled/index.html) for labelled data, e.g., from *SPSS* or *Stata* - [`sf`](https://github.com/r-spatial/sf) for geospatial data --- ## Other packages for data import For data import (and export) in general, there are even more options, such as... - `base` R - the [`foreign` package](https://cran.r-project.org/web/packages/foreign/index.html) for *SPSS* and *Stata* files - [`data.table`](https://cran.r-project.org/web/packages/data.table/index.html) or [`fst`](https://www.fstpackage.org/) for large data sets - [`jsonlite`](https://cran.r-project.org/web/packages/jsonlite/index.html) for `.json` files - [`datapasta`](https://github.com/MilesMcBain/datapasta) for copying and pasting data into tribbles (e.g., from websites, *Excel* or *Word* files) --- ## Reminder regarding file paths In general, you should avoid using absolute file paths to maintain your code reproducible and future-proof. We already talked about this in the introduction, but this is particularly important for importing and exporting data. As a reminder: Absolute file paths look like this (on different OS): ```r # Windows load("C:/Users/cool_user/data/fancy_data.Rdata") # Mac load("/Users/cool_user/data/fancy_data.Rdata") # GNU/Linux load("/home/cool_user/data/fancy_data.Rdata") ``` --- ## Use relative paths Instead of using absolute paths, it is recommended to use relative file paths. The general principle here is to start from a directory where your current script currently exists and navigate to your target location. Say we are in the "C:/Users/cool_user/" location on a Windows machine. To load your data, we would use: ```r load("./data/fancy_data.Rdata") ``` If we were in a different folder, e.g., "C:/Users/cool_user/cat_pics/mittens/", we would use: ```r load("../../data/fancy_data.Rdata") ``` --- class: center, middle Please first download the [GGSS 2021](https://search.gesis.org/research_data/ZA5280) as .sav, .dta, and .csv file. # [Exercise](https://stefanjuenger.github.io/r-intro-gesis-2022/exercises/Exercise_1_2_3_Statistical_Software_Files.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://stefanjuenger.github.io/r-intro-gesis-2022/solutions/Exercise_1_2_3_Statistical_Software_Files.html)